Assignment 1

1.1.

Sys.setenv('MAPBOX_TOKEN' = "sk.eyJ1Ijoid291bnRhaW4iLCJhIjoiY2xtZzk5dmhiMWNubjNrcXFidzhlYmU4cyJ9.4bXukm7QheBqcyBIXPLchw")

p1 <- plot_mapbox(df1) %>% filter(YEAR == 2004) %>% add_trace(type = "scattermapbox", lat = ~Y, lon = ~X,
                                                              color = ~VECTOR) %>% layout(mapbox = list(style = "open-street-map"))

p1

Equirectangular map of mosquito detection 2004

p2 <- plot_mapbox(df1) %>% filter(YEAR == 2013) %>% add_trace(type = "scattermapbox", lat = ~Y, lon = ~X,
                                                              color = ~VECTOR) %>% layout(mapbox = list(style = "open-street-map"))

p2

Equirectangular map of mosquito detection 2013

The cylindrical projection for this map makes the far northern and southern areas inflated.

In 2004, Albopictus was generally detected in USA, Europe and SE Asia, albeit absolute majority in Taiwan. Aegypti is detected more around or below the equator. Detection in Vietnam and Indonesia, India, Kenya and Nigeria. They also occur in Texas as well as Mexico. However the most dense areas are in South America. Colombia, Venezuela in the north, but most prominently in Brazil.

In 2013, Albopictus has alot less area covered, but the areas where they exist is alot more populated. They have very high density in Taiwan, and barely no where else. The density in Taiwan has increased rapidly, but for the rest of the world it has decreased. The amount of Albopictus detected has sunk from 4324 to 1037 between 2004 and 2013. Regarding Aegypti the amount has exploded, from 1558 in 2004 to 6011 cases in 2013. However, the spread has concentrated alot to Brazil, with only a few cases outside of it. Brazil had a big spread in 2004 but relative to 2013 it’s very scarce, and detection seems to have skyrocketed throughout all of the country. What type of perception problems can be found in these plots? Taiwan is very small with the absolute highest detection rate but might be hard to see. Points can also be on top of each other and it is generally hard to draw conclusions around density.

1.2.

# World Map projection
g12 <- list(
  projection = list(type = "equirectangular")
)
df12 <- df1 %>% group_by(COUNTRY) %>% summarise(Detections = n()) # new df which summarise the amount of observations per country

p3 <- plot_geo(df12) %>% add_trace(z = ~Detections,
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g12, showlegend = FALSE)
p3

Equirectangular map of mosquitos detected by country during all study period

One reason could be Taiwan being a severe outlier compared to both the other countries with high detection rates, and especially those with low. This makes the scale weird and the other countries colors are all very light as their detections pale in comparison. When countries are missing from the data, their borders aren’t included either which makes the map confusing for those cases.

1.3.

p4 <- plot_geo(df12) %>% add_trace(z = ~log(Detections),
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   color = ~Detections,
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g12, showlegend = FALSE)
p4

Equirectangular map of log-transformed mosquitos detected by country during all study period

g13 <- list(
  projection = list(type = "conic equal area")
)

p5 <- plot_geo(df12) %>% add_trace(z = ~log(Detections),
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   color = ~Detections,
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g13, showlegend = FALSE)

p5

Conic equal area map of log-transformed mosquitos detected by country during all study period

The equirectangular plot has the advantage that it’s easy to get a picture of the whole globe with ease. You can pre-attentively compare different countries all over the world. The conic equal are, aside from providing correct areal imaging, is probably easier to use if you want to compare differences between say two neighboring countries.

The difference is how your pre-attentive can work for either comparing large areas like continents or countries far from each other, or countries close to each other. The attentive mechanism is activated in the equirectangular area when you try to pinpoint an area. The attentive mechanism is activated in the conic equal area when you have to move the graph alot to compare larger areas which makes you lose track of where you are really. Another disadvantage with the cone projections makes shapes and distances incorrect.

1.4.

df14 <- df1 %>% dplyr::filter(.,COUNTRY == "Brazil" & YEAR == 2013)

## a)
df14$X1 <- cut_interval(df14$X, 100)

## b)
df14$Y1 <- cut_interval(df14$Y, 100)

## c)
df14 <- df14 %>% dplyr::group_by(X1, Y1) %>% summarise(mean_X = mean(X), mean_Y = mean(Y), N = n(), .groups = "keep")

## d)
p6 <- plot_mapbox(df14) %>% add_trace(type = "scattermapbox", 
                                      lat = ~mean_Y, lon = ~mean_X,
                                      color = ~N) %>% layout(mapbox = list(style = "open-street-map"), showlegend = FALSE)

p6

Scatterplot of mosquito detections in Brazil 2013

The regions most infested by mosquitos are Joao Pessoa and Recife along the very most eastern coast. Other than that its the area around Sao Paolo in the south-east that is more infested than other parts of the country. The discretization has helped since groups of coordinates were able to be created with structurized the country into grid points and made it easy to both count and visualize the individual cases from each area.

To the north west the density of mosquito detections is much lower than the rest of the country, this could be because of less mosquitos or under-reporting as amazons covers a big part of this land.

Assignment 2

2.1.

# changed the encoding to get the åäö in the names
data_scb <- read.csv("000006SW_20230912-130931.csv",fileEncoding = "ISO-8859-1")

data_json <- fromJSON(file="gadm41_SWE_1.json")

# splitting the age variable 

# 1
data_scb <- spread(data_scb, key=age, value=X2016)

colnames(data_scb)[2:4] <- c('Young', 'Adult', 'Senior')

knitr::kable(caption = "First 10 rows from the processed data", head(data_scb, n = 10))
First 10 rows from the processed data
region Young Adult Senior
01 Stockholm county 418.9 716.4 742.5
03 Uppsala county 327.6 589.9 631.1
04 Södermanland county 346.0 532.0 551.6
05 Östergötland county 316.0 546.4 579.4
06 Jönköping county 359.6 563.7 605.0
07 Kronoberg county 334.4 546.9 577.2
08 Kalmar county 340.4 526.9 544.8
09 Gotland county 329.7 530.9 532.9
10 Blekinge county 318.7 521.2 536.2
12 Skåne county 339.4 559.5 594.5

2.2.

# violin plot 
fig <- plot_ly(data_scb)
fig <- add_trace(fig, y=~Young,name='Young', type='violin', box=list(visible=T))
fig <- add_trace(fig, y=~Adult,name='Adult', type='violin', box=list(visible=T))
fig <- add_trace(fig, y=~Senior,name='Senior', type='violin', box=list(visible=T)) %>%
          layout(yaxis=list(title ='Mean income'))
fig 

Violin plot over mean income

When looking at the distribution for the mean income for each age-group you can see that there is a big difference between the young group and the two others, this is expected as the mean for young people should be quite low as many would still be studying and/or doing low income jobs as they have limited experience to the other groups. The highest values of mean income are for the senior group and you can see that there are more higher values for this group than the Adult group as its wider for higher values than the adult violin. The variance is lowest for the young group as the violin is shorter than the other two, each group seem to have two outliers except for the senior group that only have one, that may be because the variance for mean income is very high for this group

All violin show that the distribution of the mean income is skewed to the right, some regions have a much higher mean income than others.

The median your the young group is 334.4, the adult group have a median of 531 and the senior group have a median of 551.6, the mode for each group is close their median. None of the distributions look like some of the famous ones we know but the young group might remind us of a normal distribution with 2 outliers as the boxplot show relatively the same distances from the median.

2.3.

s <- interp(data_scb$Young,data_scb$Adult,data_scb$Senior, duplicate = 'mean')

plot_ly(x=~s$x, y=~s$y, z=~s$z, type="surface", colorbar=list(title='Senior mean income')) %>% 
        layout(scene = list(xaxis = list(title = 'Young mean income'),
               yaxis = list(title = 'Adult mean income'),
               zaxis = list(title = 'Senior mean income')), showlegend = FALSE)

Surface plot showing the dependence of Senior in come on Adult and Young

Regions with higher mean income for seniors seem to have higher income for the the young and adult groups aswell, and vice versa for lower values.

There appears to be a strong positive correlation between the variables and therefore a linear regression would be appropriate. However there would probably be problems with multicollinearity if you use both young and adult in the same model to describe senior income. A linear regression could also be set up with mean income as the dependent variable(Y) and using the age-groups and regions as factors to describe the mean income.

2.4.

data_scb$region <- str_replace(data_scb$region,' county','')

data_scb$region <- gsub("[[:digit:] ]", "",data_scb$region )

data_scb$region[data_scb$region=='Örebro'] <- 'Orebro' # örebro is Orebro in json file..
# there is also numbers and spaces for each region so that will have to be taken care of for a possible merge

g=list(fitbounds="locations", visible=FALSE)
p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                            z=~Adult, featureidkey="properties.NAME_1",name="")%>%
  layout(geo=g, showlegend = FALSE)
p

Choropleth plot for the mean income of Adults

p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                                  z=~Young, featureidkey="properties.NAME_1",name="")%>%
  layout(geo=g, showlegend = FALSE)
p

Choropleth plot for the mean income of Young

Stockholmslän is the county with the highest mean income for both groups of ages, this is expected as it contains the capital and the biggest city in Sweden which generates the most money. Hallandslän also seem to stand out for both groups to have a relatively high mean income. So these two counties are the outliers that could be seen in the violin plot for these two groups.

The variance is lower for young people so to have the color switch with saturation will here create a more visual difference between the counties than for the adults even though the actual difference in mean income between the counties can be lower for the young group.

For the adults there seems to be a difference in the mean income dependent on the longitude of where the county is in Sweden, counties in the north seem to have a lower mean income than counties in the south of Sweden. This is not as obvious for the young group as Norrbottenslän in the far north is of a brighter color(higher mean income) than most counties in the middle of Sweden.

There seem to be no difference in mean income for counties dependent on whether they are in the east or west of Sweden.

2.5.

p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                                  z=~Young, featureidkey="properties.NAME_1",name="")%>%
  add_trace(lat = ~58.41109, lon = ~15.62565, type='markers',marker=list(color="red"), name='Linköping') %>% # changed the color
  layout(geo=g, showlegend = FALSE) # and name on the marker to red and Linköping!
p

Choropleth plot for the mean income of Young with Linköping marked

Now you can see Linköping on the map! Östergötland is one of the counties with the lowest mean income for young people.

Appendix

# setting graphs to be 1:1.5 in size in global options
knitr::opts_chunk$set(echo = TRUE, fig.width = 6,fig.height = 4, fig.align = 'center', warning = FALSE, message = FALSE)



# load necessary libraries
library(dplyr)
library(ggplot2)
library(plotly)
library(GGally)
library(MASS)
library(akima)
library(rjson)
library(geojsonio)
library(sf)
library(stringr)
library(tidyr)
df1 <- read.csv("aegypti_albopictus.csv")
# There seems to be errors of additional spacing for some of the "Less than x km" values in $LOCATION_TYPE.

df1 <- df1 %>%
  mutate(LOCATION_TYPE = replace(LOCATION_TYPE, LOCATION_TYPE == "Less than 10 km", "Less than 10km")) %>%
  mutate(LOCATION_TYPE = replace(LOCATION_TYPE, LOCATION_TYPE == "Less than 100 km", "Less than 100km")) %>%
  mutate(LOCATION_TYPE = replace(LOCATION_TYPE, LOCATION_TYPE == "Less than 25 km", "Less than 25km"))
Sys.setenv('MAPBOX_TOKEN' = "sk.eyJ1Ijoid291bnRhaW4iLCJhIjoiY2xtZzk5dmhiMWNubjNrcXFidzhlYmU4cyJ9.4bXukm7QheBqcyBIXPLchw")

p1 <- plot_mapbox(df1) %>% filter(YEAR == 2004) %>% add_trace(type = "scattermapbox", lat = ~Y, lon = ~X,
                                                              color = ~VECTOR) %>% layout(mapbox = list(style = "open-street-map"))

p1

p2 <- plot_mapbox(df1) %>% filter(YEAR == 2013) %>% add_trace(type = "scattermapbox", lat = ~Y, lon = ~X,
                                                              color = ~VECTOR) %>% layout(mapbox = list(style = "open-street-map"))

p2

# World Map projection
g12 <- list(
  projection = list(type = "equirectangular")
)
df12 <- df1 %>% group_by(COUNTRY) %>% summarise(Detections = n()) # new df which summarise the amount of observations per country

p3 <- plot_geo(df12) %>% add_trace(z = ~Detections,
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g12, showlegend = FALSE)
p3

p4 <- plot_geo(df12) %>% add_trace(z = ~log(Detections),
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   color = ~Detections,
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g12, showlegend = FALSE)
p4
g13 <- list(
  projection = list(type = "conic equal area")
)

p5 <- plot_geo(df12) %>% add_trace(z = ~log(Detections),
                                   locations = ~COUNTRY,
                                   locationmode = "country names", # identify country names as locations
                                   color = ~Detections,
                                   colors = "Blues",
                                   text = ~COUNTRY) %>% layout(geo = g13, showlegend = FALSE)

p5

df14 <- df1 %>% dplyr::filter(.,COUNTRY == "Brazil" & YEAR == 2013)

## a)
df14$X1 <- cut_interval(df14$X, 100)

## b)
df14$Y1 <- cut_interval(df14$Y, 100)

## c)
df14 <- df14 %>% dplyr::group_by(X1, Y1) %>% summarise(mean_X = mean(X), mean_Y = mean(Y), N = n(), .groups = "keep")

## d)
p6 <- plot_mapbox(df14) %>% add_trace(type = "scattermapbox", 
                                      lat = ~mean_Y, lon = ~mean_X,
                                      color = ~N) %>% layout(mapbox = list(style = "open-street-map"), showlegend = FALSE)

p6

# changed the encoding to get the åäö in the names
data_scb <- read.csv("000006SW_20230912-130931.csv",fileEncoding = "ISO-8859-1")

data_json <- fromJSON(file="gadm41_SWE_1.json")

# splitting the age variable 

# 1
data_scb <- spread(data_scb, key=age, value=X2016)

colnames(data_scb)[2:4] <- c('Young', 'Adult', 'Senior')

knitr::kable(caption = "First 10 rows from the processed data", head(data_scb, n = 10))

# violin plot 
fig <- plot_ly(data_scb)
fig <- add_trace(fig, y=~Young,name='Young', type='violin', box=list(visible=T))
fig <- add_trace(fig, y=~Adult,name='Adult', type='violin', box=list(visible=T))
fig <- add_trace(fig, y=~Senior,name='Senior', type='violin', box=list(visible=T)) %>%
          layout(yaxis=list(title ='Mean income'))
fig 

s <- interp(data_scb$Young,data_scb$Adult,data_scb$Senior, duplicate = 'mean')

plot_ly(x=~s$x, y=~s$y, z=~s$z, type="surface", colorbar=list(title='Senior mean income')) %>% 
        layout(scene = list(xaxis = list(title = 'Young mean income'),
               yaxis = list(title = 'Adult mean income'),
               zaxis = list(title = 'Senior mean income')), showlegend = FALSE)

print(data_json$features[[11]]$properties)
# The county in the data frame are in English 'county' 
data_scb$region <- str_replace(data_scb$region,' county','')

data_scb$region <- gsub("[[:digit:] ]", "",data_scb$region )

data_scb$region[data_scb$region=='Örebro'] <- 'Orebro' # örebro is Orebro in json file..
# there is also numbers and spaces for each region so that will have to be taken care of for a possible merge

g=list(fitbounds="locations", visible=FALSE)
p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                            z=~Adult, featureidkey="properties.NAME_1",name="")%>%
  layout(geo=g, showlegend = FALSE)
p

p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                                  z=~Young, featureidkey="properties.NAME_1",name="")%>%
  layout(geo=g, showlegend = FALSE)
p
p<-plot_geo(data_scb)%>%add_trace(type="choropleth",geojson=data_json, locations=~region,
                                  z=~Young, featureidkey="properties.NAME_1",name="")%>%
  add_trace(lat = ~58.41109, lon = ~15.62565, type='markers',marker=list(color="red"), name='Linköping') %>% # changed the color
  layout(geo=g, showlegend = FALSE) # and name on the marker to red and Linköping!
p